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Abstract.  Social  circles  detection  is  a  special  case  of  community  detec¬ 
tion  in  social  network  that  is  currently  attracting  a  growing  interest  in 
the  research  community.  We  propose  in  this  paper  an  empirical  evalu¬ 
ation  of  the  multi-assignment  clustering  method  using  different  feature 
representation  models.  We  define  different  vectorial  representations  from 
both  structural  egonet  information  and  user  profile  features.  We  study 
and  compare  the  performance  on  the  available  labelled  Facebook  data 
from  the  Kaggle  competition  on  learning  social  circles  in  networks.  We 
compare  our  results  with  several  different  baselines. 

Keywords:  social  circles  detection,  community  detection,  feature  rep¬ 
resentations 


1  Introduction 

Nowadays,  users  in  social  networks  tend  to  organize  the  contacts  in  their  personal 
networks  by  means  of  social  circles,  a  tool  already  implemented  by  the  major 
companies,  like  for  instance  Facebook  lists  or  Google+  circles.  However,  this 
labelling  is  still  mostly  done  manually  and  therefore  a  growing  interest  has  risen 
in  the  automatic  detection  of  these  circles.  In  addition,  this  problem  is  related 
to  the  more  general  task  of  community  detection  in  graphs,  or  the  identification 
of  subnetworks  in  a  given  network.  The  main  difference  between  both  problems 
is  the  use  of  information  from  users’  profiles,  apart  from  information  from  the 
network  structure  itself. 

Despite  the  lack  of  a  precise  and  well-accepted  definition  of  community,  there 
is  a  wide  variety  of  methods  and  techniques  designed  to  cope  with  community 
detection  [3,10].  Moreover,  some  techniques  specifically  designed  for  social  cir¬ 
cles  detection  are  being  developed  currently  [6,  7].  In  this  article,  we  present  our 
approach  based  on  multi-assignment  clustering  (MAC)  [13,4],  originally  a  clus¬ 
tering  technique  for  Boolean  vectorial  data  not  necessarily  related  to  networks 
or  graphs.  The  advantage  of  this  technique  is  the  possibility  to  assign  the  same 
object  into  several  different  clusters,  different  social  circles.  MAC  has  already 
been  tried  for  social  circles  detection  [6,  7]  but  only  using  a  very  simple  fea¬ 
ture  representation,  considering  only  user  profile  features,  ignoring  the  network 


structure.  In  our  work  we  propose  different  and  novel  approaches  by  considering 
different  representations  of  both  network  structure  and  user  profile  features. 

The  rest  of  the  paper  is  structured  as  follows.  In  section  two,  we  present 
previous  works  on  community  detection  and  social  circles  detection.  In  section 
three,  we  describe  thoroughly  our  methodology,  including  the  different  data  rep¬ 
resentations  proposed  and  the  baseline  methods  to  compare  with.  In  section 
four,  we  present  the  dataset  and  the  evaluation  measure  of  our  experiments.  In 
section  five,  we  discuss  the  obtained  results.  Finally,  we  draw  some  conclusions. 

2  Previous  Work 

2.1  Community  Detection  in  Networks 

From  an  abstract  point  of  view,  a  network  is  equivalent  to  a  graph,  defined  by 
a  set  of  nodes  connected  by  edges.  Nevertheless,  from  the  point  of  view  of  re¬ 
searchers  devoted  to  a  diversity  of  fields,  the  concept  of  network  has  additional 
connotations.  Networks  can  represent  real  structures  such  as  social  networks,  bio¬ 
logical  networks  (neural  synaptic  networks,  metabolical  networks),  technological 
networks  (the  Internet,  the  World  Wide  Web),  logistic  networks  (distribution 
networks),  etc.  There  is  no  well-accepted  formal  definition  of  community  in  gen¬ 
eral  networks.  However,  there  is  a  consensus  on  the  fact  that  it  consists  of  a 
group  of  nodes  that  are  more  densely  connected  to  each  other  than  to  the  nodes 
outside.  The  relation  of  membership  in  a  community  usually  has  an  extra  mean¬ 
ing,  and  the  vertices  in  a  community  will  probably  share  common  properties  or 
play  similar  roles  within  the  graph. 

Community  detection  is  the  task  of  automated  identification  of  the  commu¬ 
nities  of  a  network.  A  considerable  number  of  methods  have  been  developed  to 
solve  this  problem  [3, 10]. 

In  real  networks  nodes  are  often  shared  among  different  communities.  The 
most  popular  technique  to  detect  overlapping  communities  is  the  clique  percola¬ 
tion  method  [8].  Given  a  graph,  a  fc-clique  is  defined  as  a  complete  subgraph  of 
size  k.  Clique  percolation  consists  in  the  identification  of  fc-clique  communities, 
defined  as  the  union  of  all  fc-cliques  that  can  be  reached  from  each  other  through 
a  series  of  adjacent  /c-cliques.  Despite  of  the  good  performance  of  this  technique, 
clique  percolation  remains  a  hard  computational  problem,  new  and  improved  im¬ 
plementations  still  scale  worse  than  some  other  overlapping  community  finding 
algorithms. 


2.2  Application  to  Social  Networks  and  Social  Circles  Detection 

The  study  of  social  networks  is  a  research  topic  with  a  history  of  decades  and 
it  has  been  recently  revitalized  by  the  appearance  of  new  information  and  com¬ 
munication  technologies  which  have  opened  new  ways  of  interacting.  Clustering 
of  this  social  content  has  been  studied  designing  several  procedures.  Some  ap¬ 
proaches  base  the  clustering  on  the  network  links  [10],  while  others  consider 


the  semantic  content  of  social  interactions  [15].  In  between  both  methodologies, 
there  has  also  been  work  on  combining  the  links  and  the  content  for  doing  the 
clustering  [9,12].  Very  recently,  a  new  technique  studied  the  characteristics  of 
community  structures  formed  around  topical  discussion  clusters,  using  modular¬ 
ity  maximization  algorithms  [2]. 

Social  circles  detection  is  a  special  case  of  this  framework.  Within  a  social 
network,  an  ego  network  or  egonet  is  defined  as  the  subgraph  of  the  contacts  of  a 
particular  user  (called  the  ego) .  Thus,  it  includes  all  the  contacts  of  the  ego  and 
the  contact  relationship  between  every  pair  of  them.  Then,  the  social  circles  of  an 
ego  can  be  considered  as  clusters  of  the  egonet.  Social  circles  may  overlap  (share 
nodes),  for  example  university  friends  who  were  high  school  friends  as  well;  and 
they  may  also  present  hierarchical  inclusion  (the  nodes  of  a  circle  totally  included 
into  another),  for  example  university  friends  into  a  generic  friends  category. 
Apart  from  the  links  of  the  egonet,  user  profile  information  is  also  normally 
considered  in  this  task.  The  latest  works  on  social  circles  detection  define  a 
generative  model  that  considers  circle  memberships  and  a  circle-specific  profile 
similarity  metric  [6,  7]. 


3  Methodology 

3.1  Multi- Assignment  Clustering 

Multi- Assignment  Clustering  (MAC)  [13, 4]  is  a  clustering  method,  originally 
developed  for  Boolean  vectorial  data,  which  allows  for  the  possibility  to  assign 
the  same  object  into  several  different  clusters.  It  provides  a  decomposition  of  the 
data  matrix  x  into  a  matrix  containing  the  clusters  prototypes  z  and  a  matrix 
representing  the  degree  to  which  a  particular  data  vector  belongs  to  the  different 
clusters  y.  Finding  optimal  matrices  z  and  y  is  NP-hard  [14],  but  a  probabilistic 
representation  allows  to  drastically  simplify  the  optimization  problem.  In  [13] 
the  authors  propose  to  model  the  probability  of  Xij  under  the  signal  model  as: 


P(xij  |z,/3)  = 
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,  where  fikj  :=  p(ykj  =  0)  (1) 


In  addition  to  the  signal  model,  there  is  a  noise  model  for  the  difference  between 
the  original  data  and  the  reconstruction  made  from  z  and  y.  The  model  pa¬ 
rameters  are  inferred  by  deterministic  annealing  [11,1].  When  MAC  is  applied 
to  social  circles  detection,  y  is  the  matrix  indicating  which  users  belong  to  the 
different  clusters,  social  circles. 

In  [6,  7]  MAC  was  already  employed  and  considered  as  a  baseline  method  for 
social  circles  detection,  although  using  only  user  profile  information.  This  piece 
of  evidence  that  MAC  is  a  state-of-the-art  technique,  having  recent  and  influ¬ 
ential  publications,  helped  us  making  the  choice  over  alternative  soft-clustering 
strategies.  In  this  work,  we  propose  to  explore  further  its  possibilities  for  this 


task,  investigating  novel  representations.  We  defend  the  fact  that  this  technique 
still  has  potential  and  better  results  can  be  obtained.  Furthemore,  MAC  is  more 
adequate  for  large  networks  than  other  methods  with  a  very  high  computational 
cost,  like  clique  percolation. 

As  a  novelty,  we  model  the  structural  information  of  the  egonets  into  diverse 
vectorial  representations  ready  to  be  supplied  to  the  algorithm.  Several  vecto¬ 
rial  representations  for  user  profile  features  were  developed  as  well.  Unlike  the 
original  MAC,  we  allow  the  input  to  be  real  data  in  [0,  l]n  as  a  way  to  model  a 
hierarchy  of  link  levels  in  the  case  of  structural  information,  or  an  aggregation 
of  the  number  of  feature  values  shared  by  two  users  profiles  in  the  case  of  user 
profile  information. 

In  all  the  experiments,  the  input  data  matrix  x  is  a  horizontal  concatena¬ 
tion  of  a  matrix  s,  containing  structural  network  information,  and  a  matrix  p, 
containing  profile  features  information:  x  =  [s  |  p] .  Rows  represent  users  of  the 
egonet  and  therefore  for  every  user  u  there  is  a  row  vector  of  structural  network 
information,  s„,  and  a  row  vector  of  profile  features  information,  pM.  Therefore, 
the  number  of  rows  of  the  matrix  x  is  the  number  of  users  in  the  ego-network 
|  u  | ,  and  the  number  of  columns  of  the  matrix  x  is  the  total  number  of  features 
used  to  represent  structural  and  profile  information  of  each  user. 

3.2  Structural  network  representation 

In  this  subsection,  we  present  the  different  representations  of  the  structural  net¬ 
work  information  that  have  been  considered.  All  of  them  transform  graph  links 
into  the  matrices  s.  We  use  the  following  concepts: 

—  Friendship  ranks:  when  there  is  a  link  between  two  users,  we  say  they  are 
direct  friends  or  rank  1  friends.  When  two  users  are  not  direct  friends  but 
have  a  common  direct  friend,  we  say  they  are  rank  2  friends.  Friendship 
ranks  of  greater  levels  can  be  further  defined.  In  this  study  we  consider  up 
to  rank  3  friends.  There  is  a  column  in  s  for  every  friendship  rank  and  user 
in  the  egonet.  An  element  of  s  is  1  if  the  row  user  and  the  column  user  are 
friends  of  such  rank,  and  0  otherwise.  Obtaining  in  total  3x  |  u  |  structural 
features  for  each  user. 

—  Weighting:  the  data  is  weighted  depending  on  the  friendship  rank  it  repre¬ 
sents.  Rank  1  friendship  is  left  with  1,  whereas  rank  2  friendship  is  weighted 
to  0.5  and  rank  3  friendship  is  weighted  to  0.25.  Like  in  the  previous  case, 
obtaining  in  total  3x  |  u  |  structural  features  for  each  user. 

—  Aggregation:  for  every  user,  the  different  friendship  ranks  are  aggregated 
into  just  one  value.  This  is  obtained  by  calculating  the  maximum  weighted 
friendship  rank.  Reducing  the  number  of  structural  features  to  |  u  |. 

From  these  concepts  we  define  the  representations  shown  in  Table  1. 

3.3  User  profile  representation 

There  are  up  to  57  profile  features  for  every  user  in  the  data  corpus  we  used  for 
the  experiments.  Nevertheless,  some  of  them  are  very  seldom  informed  whereas 


Table  1.  Representations  of  structural  network  information 


Representation 

Definition 

rl 

Rank  1 

rl2 

Ranks  1  and  2 

rl23 

Ranks  1,  2  and  3 

rl2w 

Ranks  1  and  2,  weighted 

rl23w 

Ranks  1,  2  and  3,  weighted 

r!2a 

Ranks  1  and  2,  aggregated 

r!23a 

Ranks  1,  2  and  3,  aggregated 

others  are  redundant  or  not  relevant  for  the  task.  As  a  consequence,  we  have 
selected  the  3  most  informative  features  and  we  use  only  these.  The  selected 
features  are:  hometown ,  schools  and  employers.  Each  of  these  features  can  take 
different  discrete  values  from  a  finite  set. 

We  define  as  |  /  |  the  number  of  features  considered,  and  as  |  v  |  the  total 
number  of  values  of  the  considered  features  that  are  taken  by  at  least  one  user 
in  the  egonet.  We  encode  the  profile  features  information  in  the  matrices  p,  for 
which  the  following  representations  have  been  defined: 

—  Explicit:  There  is  a  column  of  p  for  every  different  value  of  the  considered 
features.  An  element  of  p  is  1  if  the  row  user  takes  the  column  value  for  the 
respective  feature,  and  0  otherwise.  Obtaining  in  total  |  v  |  profile  features 
for  each  user. 

—  Intersection:  There  is  one  column  of  p  for  every  user  in  the  egonet  and  every 
considered  profile  feature.  An  element  of  p  is  1  if  the  sets  of  values  of  the 
row  user  and  the  column  user,  for  that  particular  feature,  intersect.  It  is  0 
otherwise.  In  this  case,  obtaining  |  /  |  x  |  u  |  profile  features  for  each  user. 

—  Weighted:  There  is  just  one  column  of  p  for  every  user  in  the  egonet.  An 
element  of  p  represents  the  proportion  of  features  for  which  the  row  user  and 
the  column  user  share  at  least  one  value.  It  is  calculated  as  jji,  where  |  s  \ 
is  the  number  of  features  shared  between  both  users.  Reducing  the  number 
of  profile  features  to  |  u  |. 

4  Experiments 

The  corpus  we  use  for  the  experiments  is  the  one  published  for  the  Kaggle 
competition  on  learning  social  circles  in  networks  [5] .  The  data  consist  of  hand- 
labelled  friendship  egonets  from  Facebook  and  a  set  of  57  profile  features  for 
every  node  in  those  networks.  We  discarded  every  egonet  for  which  the  ground 
truth  is  not  available.  Out  of  the  60  egonets  we  finally  considered,  the  smallest 
one  contains  45  users  and  the  largest  one  contains  670  users.  The  60  egonets 
altogether  comprise  14,519  users. 


The  degree  of  a  given  user  is  defined  as  the  number  of  different  circles  which 
it  belongs  to.  MAC  takes  as  a  parameter  the  range  of  possible  degrees  of  the 
users  of  an  egonet.  In  all  our  experiments  the  minimum  degree  is  set  to  0  and 
we  try  several  values  for  the  maximum  degree,  up  to  3.  In  this  regard,  unlike 
previous  studies,  we  do  not  include  any  prediction  technique  for  the  number 
of  circles  within  the  egonets,  using  the  number  of  circles  of  the  ground-truth 
instead.  In  future  works,  that  would  be  easily  incorporated  with  methods  such 
as  the  bayesian  information  criterion  employed  in  [6,7]. 

The  evaluation  measure  of  our  experiments,  and  proposed  in  Kaggle,  is  cal¬ 
culated  as  follows: 

An  evaluation  measure  for  every  egonet  e  is  computed  as  an  edit  distance  between 
the  ground  truth  circles  (ge)  and  the  predicted  circles  (pe):  EDMe  =  d (ge,pe). 
Four  basic  edit  operations  are  considered:  adding  a  user  to  an  existing  circle, 
creating  a  circle  with  one  user,  removing  a  user  from  a  circle  and  deleting  a 
circle  with  one  user;  every  one  of  them  at  cost  1. 

The  evaluation  measure  of  the  whole  dataset  is  the  sum  of  the  edit  distances 
obtained  for  all  the  egonets. 


EDM  =  Y^  ED AIe,  (2) 

eeE 

being  E  the  set  of  the  egonets  in  the  corpus. 

The  smaller  EDM  is,  the  better  the  performance  of  the  prediction. 

5  Results 

We  compare  our  results  to  several  different  baselines.  First  of  all,  we  consider 
MAC  when  it  receives  only  structural  information,  using  an  rl  representation. 
MAC  with  only  profile  features,  in  this  case,  we  use  an  explicit  representation 
of  the  features.  The  use  of  both  baselines  has  the  aim  to  show  to  what  degree 
the  combination  of  structural  network  and  profile  information  improves  either 
of  these  sources  of  information  when  taken  independently. 

Empty  circles  is  the  third  baseline  we  employ.  This  baseline  relies  on  the  fact 
that  the  evaluation  measure  used  in  this  study  heavily  penalizes  the  misclassifi- 
cation  of  users  into  circles.  Thus,  defining  no  circle  at  all  performs  better  than 
other  possible  simple  baselines  like  connected  components  or  classifying  all  the 
friends  of  an  ego  into  just  one  circle. 

Finally,  we  have  considered  a  very  high-performing  baseline  by  using  a  5- 
clique  percolation  algorithm.  However,  this  cannot  be  done  for  every  egonet  due 
to  its  exponential  computational  complexity.  Therefore,  we  replace  the  clique 
percolation  predictions  by  empty  circles  in  those  cases. 

It  would  be  interesting  to  report  results  of  the  participants  of  the  Kaggle 
competition  from  which  we  borrowed  the  data,  as  well.  Unfortunately,  there  are 
only  publicly  available  rankings  for  the  test  dataset,  for  which  the  ground  truth 
is  not  available.  Thus,  there  is  no  possibility  to  make  this  comparison. 


The  evaluation  measures  obtained  by  the  baselines  and  our  experiments  are 
shown  in  Table  2.  Only  results  obtained  from  weighted  and  aggregated  structural 
egonet  representations  are  presented,  as  non-weighting  has  always  performed 
worse. 


Table  2.  Baselines  and  results  of  the  experiments 


Baseline 

EDM  j 

MAC  only  structure 

18679 

MAC  only  profile 

20271 

Empty  circles 

17101 

Clique  percolation 

15350 

Data  representation 

EDM  j 

Structural 

Profile 

deg.  1 

deg.  2 

deg.  3 

rl2w 

Explicit 

16962 

16803 

16827 

rl2w 

Intersection 

16032 

16360 

15927 

rl2w 

Weighted 

17001 

16955 

16920 

rl23w 

Explicit 

17106 

17053 

17082 

rl23w 

Intersection 

16520 

16504 

16518 

rl23w 

Weighted 

16994 

17065 

17075 

rl2a 

Explicit 

15797 

15619 

15570 

rl2a 

Intersection 

16433 

16840 

16694 

rl2a 

Weighted 

15725 

15751 

15625 

rl23a 

Explicit 

16804 

16770 

16703 

rl23a 

Intersection 

16960 

17000 

17383 

r!23a 

Weighted 

16634 

16542 

16558 

The  best  results  have  been  produced  when  considering  friendship  of  ranks  1 
and  2,  aggregated;  and  an  explicit  representation  of  the  profile  features  informa¬ 
tion,  allowing  MAC  for  a  maximum  degree  of  3.  This  representation  has  provided 
a  value  of  EDM  close  to  that  obtained  from  the  clique  percolation  baseline.  All 
the  experiments  using  the  structural  network  representation  rl2a  have  given  low 
values  of  EDM,  outperforming  the  empty  circles  baseline  in  all  the  cases  and 
most  of  the  other  representations  as  well.  The  combination  of  (weighted)  struc¬ 
tural  network  information  and  profile  features  has  always  performed  better  than 
structure  or  profile  separately. 

6  Conclusions 

Network  structure  and  profile  features  are  complementary  sources  of  information 
for  social  circles  detection.  In  addition,  weighting  of  structural  network  informa¬ 
tion  with  respect  to  friendship  levels  is  crucial  to  improve  the  results  and  get  close 
to  the  ones  provided  by  methods  such  as  clique  percolation.  This  work  opens 


the  door  to  new  research  in  the  topic,  being  possible  future  experiments  the  use 
of  a  greater  set  of  profile  features  or  better  retrieved  ones  and  the  adoption  of 
other  prediction  techniques  or  even  a  more  in-depth  study  of  MAC. 
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